ݔଶ,
, ݔௗሻ and ܡ
ሺݕଵ, ݕଶ,
, ݕௗሻ, the Euclidean distance
them is defined as below,
ܦாሺܠ, ܡሻൌ1
݀ሺݔെݕሻଶ
ௗ
ୀଵ
(2.15)
Hamming distance [Hamming, 1950] is commonly used for
data and it is defined as below,
ܦுሺܠ, ܡሻൌ1
݀|ݔെݕ|
ௗ
ୀଵ
(2.16)
ance measures the dissimilarity. A correlation coefficient can be
measure how two data points are similar to each other. The
n of the correlation coefficient for two vectors (hence two data
and y is shown below, where ߤ௫ and ߤ௬ stand for the population
ܠଶ and ߪܡଶ stand for the variances of two populations,
ߩሺܠ, ܡሻൌ
∑
ሺݔെߤ௫ሻሺݕെߤ௬ሻ
ௗ
ୀଵ
ሺ݀െ1ሻටߪܠଶߪܡଶ
(2.17)
partitioning strategy and the grouping strategy are two major
g strategies. For the former, a data space is partitioned into a
of subspaces, each of which is a cluster. Each subspace is
rised by a cluster centre. Sometimes, the variance or covariance
er is also considered as a parameter to describe a subspace. Each
nt is labelled through minimising its distance with the cluster
f a cluster model. Most clustering algorithms employing the
ng strategy are the parameterised models. This means that the
ge of how data points are grouped or clustered can be saved in a
f model parameters such as the cluster centres and the variances.
ved model parameters can be used for the inference on novel data.
eans algorithm is such a typical algorithm.